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Abstract 

All of the International Space Station (ISS) systems 
which require computer control depend upon the 
hardware and software of the Command and Data 
Handling System (C&DH) system, currently a network 
of over 30 386-class computers called Multi- 
plexor/Dimultiplexors (MDMs)[18]. The Caution and 
Warning System (C&W)[7], a set of software tasks 
that runs on the MDMs, is responsible for detecting, 
classifying, and reporting errors in all ISS subsystems 
including the C&DH. Fault Detection, Isolation and 
Recovery (FDIR) of these errors is typically handled 
with a combination of automatic and human effort. 

We are developing an Advanced Diagnostic System 
(AD S) t o augment the C&W system with deci sion 
support tools to aid in root cause analysis as well as 
resolve differing human and machine C&DH state 
estimates. These tools which draw from sources in 
model-based reasoning[ 16,29], will improve the speed 
and accuracy of flight controllers by reducing the un- 
certainty in C&DH state estimation, allowing for a 
more complete assessment of risk. We have run tests 
with ISS telemetry and focus on those C&W events 
which relate to the C&DH system itself. This paper 
describes our initial results and subsequent plans. 


1. Introduction 

The Aerospace Safety Advisory Panel (AVSP)[3] 
identified the C&DH system as a critical system that 
needs continued attention. Even as the ISS complexity 
increases, budget pressures may require scaling back 
of the ISS ground support team and the ISS crew[2]. 
There is a strong need for easy-to-configure automa- 
tion tools which assist the ground and flight teams to 
assess complex situations, suggest recovery options[9] 
and track the response of automated C&DH FDIR. 

The paper begins by presenting an overview of the 
hardware and software of the C&DH system. At ISS 
assembly complete, the C&DH system will be made 
up of over 60 MDMs executing 1 million lines of Ada 
flight software tasks. Each MDM executes the Ada 
tasks according to a rate monotonic schedule (RMS)[5]. 
We highlight the challenging nature of C&DH system 
FDIR, including high rates of false-positive C&W 
events[3] and unexplained hardware and software 
anomalies referred to in Problem Report and Correc- 
tive Action (PRACA) documents[l,8]. (Figs. 1,2, 3, 4) 

We use structural dependency models of the 
C&DH system made up of a network _ of hard- 
ware/software paths of components which govern both 
nominal and off-nominal modes of behavior. The 
components consist of MDMs, their internal boards, 
buses and software structure. This information is de- 
rived from hardware schematics[18], standard out 
(STDOUT)[25], C&W fault trees[19], as well as soft- 
ware source code. 

The models are utilized by both symptom- and 
simulation-based tools. The symptom-based tool 
TEAMSf 17,29], from the testability community, over- 


lays functional signal/test component relationships 
over structural-dependency models. The test points are 
evaluated from observations. Diagnosis is determined 
by tracing back from all active test points to those 
components which contain the test point signals. The 
simulation-based tool L2[ 15,16], from the 
model-based diagnosis community, propagates “pieces 
of stuff’ (material, electricity, heat, information) 
through the structural dependency networks in order to 
compare predictions of C&DH behavior against ISS 
C&DH observations. Diagnosis is determined by trac- 
ing back to components whose predictions lead to in- 
consistencies with observations (conflict generation). 
Through a process called candidate generation, L2 
derives a diagnosis by toggling the nominal and 
off-nominal modes of the failed components in order 
to achieve a consistent set of mode assignments to ail 
components which accounts for all observations (and 
commands). 

By integrating TEAMS and L2, we can close the 
FDIR loop. TEAMS and L2 provide state estimation 
capabilities (at different levels of abstraction), and 
have the ability to continue to perform state estimation 
in the face of partial data dropouts. L2 also provides 
regulation capabilities. TEAMS addresses issues of 
scale-up and speed of diagnosis for the whole C&DH 
system. L2 can automatically select the active paths of 
the structural dependency models either through com- 
mands or inferring reconfiguration. L2 provides capa- 
bilities to explain its reasoning processes. 

We present two scenarios derived from Guidance 
Navigation and Control (GNC) C&DH events which 
occurred on days 88-90 2002. The first scenario dem- 
onstrates a model-based method of determining 
whether two C&W events have a common root cause. 
We map -eaeh-G& W-event to the -subset-of the lSS 
hardware components for which it is responsible. We 
employ set covering methods over these subsets to 
determine potential root causes for C&W events. If the 
data from two C&W events can be accounted for by a 
single component of the model, we assume the C&W 
events have a common root cause. (Figs 5, 6, 7, 8). 

A second scenario addresses the issue that C&W 
events for MDM failure can be ascribed to both 
hardware and software causes. We provide a method to 
trace information dependencies over time between 
Ada tasks and the global shared memory called the 


Current Value Table (CVT) (Fig 10.). We accomplish 
this by extending the methods of model-based diagno- 
sis (MBD)[15,16] to model software components 
[11,12,13] as well as hardware components. We model 
the timing of Ada tasks and their I/O with the CVT. 
This allows us to shadow the execution of the Ada 
tasks in order to track software errors. When Ada task 
exceptions occur, we will provide software depend- 
ency information to determine which portions of the 
CVT to focus in order to discriminate between com- 
peting root causes of exceptions. (Figs. 10,1 1,12) 

We conclude by comparing our approach to other 
tools (CRANS[14], PEM cells[21]) and address issues 
of future work. 

2.Command & Data Handling System (C&DH) 
The function of the C&DH system is to provide 
hardware and software to support command and con- 
trol of the ISS, services for flight and ground opera- 
tions, and science payloads. This is achieved through 
a three tiered network of computers(MDMs) all run- 
ning Ada tasks and interconnected by MIL-STD 
1553B data buses (Fig. 1). The Tier 1 MDMs run 
Command and Control Software (CCS) that controls 
system-wide functions such as ISS mode. The Tier 2 
MDMs are responsible for subsystem level functions 
for Electrical Power Systems (EPS), Guidance 
Navigation and Control (GNC), Environmental Con- 
trol and Life Support Systems (ECLSS), Thermal 
Control System (TCS) as well as others. Tier 3 MDMs 
interact with the multitude of sensors and effectors 
onboard the ISS. 



Figure 1. The C&DH system consists of a three-tiered 
hierarchy of networked computers and buses. [18] 



Each MDM consists of a power supply and an 
IOCU (Input/Output Control Unit) card that contains 
the 386 SX processor and the 1553 Bus Interface 
Adaptor (BIA) to connect the MDM to upper tier 
MDMs. The MDM can also be configured with up to 
five I/O cards and five 1553 Serial-Parallel-Digital 
(SPD) network cards. Each SPD card connects the 
MDM to lower tier MDMs via the SPD card. For 
example, a GNC MDM (Fig. 2) has an IOCU with a 
BIA connected to upper tier bus CBGNC1 as well as 
five SPD 1553 cards. Each SPD card is attached to two 
buses. For example, SPD1 card attaches to buses 
LB GNC 2 and LB_TS_1 . 

Sa 



Figure 2. The GNC MDM model of an MDM from a 
PCS display [28]. The MDM has a power supply card, 
an IOCU card, and 5 SPD network cards. 


When an MDM is booted-up and its software is loaded, 
a cyclic scheduler is invoked. The cyclic scheduler is 
responsible for executing Ada tasks according to a rate 
monotonic schedule (RMS)[5]. The C&DH system 
has an internal rate of 80 Hz, with all software running 
at 10Hz, 1Hz and 0.1 Hz rates. If the task rates are 
multiples of each other a feasible RMS schedule can 
more easily be achieved. The Ada tasks communi- 
cate via a global shared memory called the current 
value table (CVT) (Fig. 3). The propagation of com- 
mands and data through the network is accomplished 
through a set of realtime Ada routines for I/O and 1553 
communication where the commands and data cycli- 
cally read from and write to the CVT, in service of 
MDM specific User Application Software (UAS). 

When Ada tasks fail, exceptions are thrown by 
tasks and caught by exception handlers within the dy- 
namic scope of the task (Fig. 4). These exception han- 
dlers respond to a variety of known error conditions 
including invalid data for sensors, bus and command 
data parameters, task overruns and watchdog timer 


timeouts. When failures occur, the MDMs usually 
transition to the diagnostic state and are taken offline. 
Frequently occurring Ada task exceptions are 
addressed with counters for classes of exceptions. 
For those classes, transition to diagnostic mode occurs 
only when an accumulated threshold is crossed 



Figure 3. Ada tasks communicate via a global shared 
memory called the current value table (CVT)[27]. 

Ada tasks can fail for several reasons. The loss of 
hardware components that software routines depend 
upon can cause an Ada task to wait until it times out. 
Overruns can cause it to experience constraint viola- 
tion and divide-by-zero errors. Violating any of the 
assumptions of rate monotonic scheduling (RMS) can 
also cause exceptions. Liu and Leyland[5] identify the 
assumptions: (1) the requests for all tasks are peri- 
odic, (2) tasks are independent and non-interacting, (3) 
execution time for each task is constant, (4) each task 
must be complete before the next request for it occurs 
and (5) task switching is instantaneous. Finally, the 
sheer complexity of the C&DH software is at the root 
for additional Ada task failures. According to [6], “It is 
not possible to achieve 100% test coverage [of the 
software] due to the enormous number of permutations 
of states in a computer program execution, versus the 
time it would take to exercise all those possible states. 
Also there is oft en a large indeterminate numbe r of 
environmental variables, too many to completely simu- 
late”. 

We will focus on classes of unexplained C&DH 
MDM failures (PRACAs[8] #2593, #3031, #3019[1]). 
These problem reports state: "There is insufficient data 
to determine the root cause"[l], due to unknown 
hardware/software interactions. The integration of 
model-based diagnosis methods with techniques from 
program slicing[l 1,12,13] provides an approach to 
combining hardware and software FDIR within a sin- 





gle approach. We plan for our system to shadow the 
execution of the Ada tasks to extend software V &V to 
the runtime, operational environment (through the use 
of framecount and checkpoint data). When an excep- 
tion handler is invoked, we can provide software de- 
pendency information to determine the root cause and 
determine which portions of the CVT to focus on to 
discriminate between competing fault hypotheses (e.g. 
Fig. 10). We will rely upon the use of the MDM Ap- 
plication Development Environment (MADE)[20] to 
the emulate of suspect Ada tasks and provided insight 
into their anomalous behavior. 


Cyclic_Task 

begin 

task_processing; 
exception when 

constraint_error | numeric_error | 
program_error| storage_error | 
task_error => 

Call Ada Exception Handler; 

when others => 

Call Default Exception Handler; 

end; 

... Call task_overrun_handIer 

end Cyclic_Task 

Figure 4. Template of realtime cyclic Ada task [27] 

3. Caution and Warning (C&W) System 
The primary fault detection system of the C&DH 
system is the C&W system[7]. This system executes at 
a 10 Hz rate, cyclically evaluating each of 10,000 
C&W events which characterize the ISS state. For 
the C&DH system two important classes of C&W 
events are MDM failure and bus failure events (Figs. 
7.8). Eac h C&W eve nt is defined by a fault tree w hose 
logic reflects a complex and/or description of 
low-level ISS telemetry parameters as well as the re- 
sults of intermediate trees. The fault trees of the C&W 
events have a limited notion of context, which has 
caused the C&W system to experience an unaccepta- 
bly high false-positive rate. C&W events are false 
positive for several reasons including when: 1) sensors 
bin in/out of nominal due to incorrect limits, or 2) 
operations are performed such as depressurization 
which cause sensors to leaves nominal range, or 3) 
cascading failures occur such as power supply failure 
leading to a string of dependent failures. Often, 


string of dependent failures. Often, mission controllers 
turn off false-positive C&Ws; however, the Aerospace 
Safety Advisory Panel has stated[3]: “Avoid the need 
to inhibit C&W alerts by countering the root causes of 
false alarms.” 

4. Approach 

We seek to address these two classes of problem: 
unexplained C&DH PRACAs and unacceptable 
false-positive C&W rates by the use of dependency 
tracking tools which can trace C&W events to root 
causes through the hardware/software paths governing 
their operation. We base our approach on the C&DH 
ground handbook which instructs flight controllers to 
respond to MDM failures by dumping a set of onboard 
logs and buffers to augment the standard cyclical te- 
lemetiy [10]. We rely on the Diagnostic Data Server 
(DDS) [4] to parse these logs and temporally organize 
the ISS events from cyclic telemetry, logs files and 
1553 bus messages. We also base our approach on the 
C&DH flight handbook[31], which instructs the astro- 
nauts to respond to a partial loss of an MDM by ad- 
dressing each C&W in the order it was received. Our 
tools will help organize C&W events by root cause. 

5. Are Two C&W Events Related? 

To determine whether any set of C&W events have 
a common root cause, we derive a diagnosis with all 
the data related to the set of C&W events of interest. If 
the minimal diagnosis is a single fault, we assume the 
set of C&W events has a single root cause. This ac- 
complished by first pre-computing for each C&W 
event the subset of components of the C&DH compo- 
nents for which it is responsible. Once the components 
sets are defined for all C&W events of interest, we 
-ut i 1 i ze set- -co veringnmethods over aUtheC&Wevent 
component sets. A non-nil intersection is a necessary 
but insufficient condition for determining if C&W 
events are related. This due to the fact that the inter- 
section of the component sets for each C&W, is a static 
analysis of the topologies, without the introduction of 
command and sensor information from parameters in 
the C&W fault trees as well as logs dumped to MCC 
for analysis which could further reduce intersection. 

We explore this process with two C&W events re- 
lated to the GNC MDM system which took place on 
days 88 and 90 in 2002. As it turns out, C&Ws 5392 




and 5014 are not related; i.e. the minimal diagnosis is 
double fault. We briefly introduce the portions of the 
C&DH/GNC topology necessary for this example and 
then step through the process of determining de- 
pendencies between these C&W events. 



Figure 5. Telemetry parameters used by Event I (CW# 
5392) and Event II (CW# 5014). 


The GNC MDM has five network cards (Figs. 2,6 
(Event II)). One of these network cards, SPD1, is 
connected to the LB_GNC_2 bus as the bus controller 
(BC) (Fig 6 (Event I)). The BC can send/receive in- 
formation via A/B redundant channels to a set of 
three remote terminals (RTs) connected to GNC de- 
vices called the reaction gimbal (RGS), the control 
momentum gyro (CMG) and the global position sys- 
tem (GPS). 



Figure 6. The components involved in Events I and 
Eli have an intersection at network card SPD1. 


passed, then the network card SPD3 fails at the same 
time as a watchdog timer (WDT) error occurs. These 
events cause the GNC MDM to go into diagnostic 
mode, which causes the framecount to stop, which 
causes Event II: C&W event 5014 (GNC MDM failed) 
to be raised. 

Event 501 * LADP01MDAPZEJ Pn.rrary GNC MDM Fail-LAB 



Figure 7. C&W Event 5014: GNC MDM Fail: “If the 
frame count has not changed for four seconds and I/O 
is ok then GNC MDM is failed”. [19] 



At 11:00 pm on day 88 of 2002, the 1553 error 
count for RT CMG on bus LB_GNC_2 exceeded its 
limits. (Fig. 5, Event I). Automatic bus FDIR took over 
and activated the A/B channel switch to determine if 
switching channels would stop the RT CMG 1553 er- 
ror messages. It did not, and soon the A/B channel 
switch counter exceeded its limits continually trying to 
address RT CMG. This caused Event I: C&W event 
5392 (LB_GNC_2 failed) to be raised. Over a day 


We develop models of an MDM and bus required to 
build the structural dependency models. The MDM 
model is developed from the connectivity information 
in the GNC MDM definition (Fig. 2,6 (Event II)) as 
well as the fault tree for C&W event 5014(Fig. 7). 
We represent both working and ~ working models of 
the MDM. The working model states that the MDM is 
working provided that the framecount is changing 
(dframecount) and that all of its internal components 






are nominal: (Ydffamecount o 0) A SPD1 A SPD3) . 
The - working model is derived using De Morgan’s 
laws from the working model: (dframecount = 0) v 
-SPDl v -SPD3) . For the sake of this example, we 
have only included in the GNC the components needed 
for our scenario (Fig 9). 

The bus model is developed from the bus connec- 
tivity information of the LB_GNC_2 bus (Fig. 6, 
Event I) and the fault tree for C&W event 5392 (Fig. 
8). The fault tree in Fig. 8 states that the bus is failed if 
any of its RTs (CMQ RGA and GPS) or the BC is 
failed, (see large text in Fig. 8). The -working model 
of the bus components is developed by reducing the 
fault tree propositions to prime implicant form through 
the use of a 4- variable Karnaugh map: - working : 
(-SPD1 v -RGA v -CMG v -GPS). The working 
model is derived by use of De Morgan’s laws: work- 
ing'- fSPDl A RGA A CMG A GPS). We integrate of 
the MDM and bus models in a structural dependency 
model in Figure 9. Infusing this model with telemetry 
and dump log information allows us to determine de- 
pendencies between C&W events 5392 and 5014. 


GNC MDM and IB jGNC. 2 Structural Dependency Model 



Figure 9. The structural dependency models of com- 


ponents covered by CW events 5014 and 5392. 


6. MDM Simulation HW/S W Using Framecount 
A closer look at the data from days 88 and 90 reveals 
that when the SPD1 network card failed, a software 
watchdog timer (WDT) tripped as well.. At present, the 
root "cause of the WDT has mot-been -determined [1] . 
By modeling the software execution and modification 
of memoiy (CVT) we hope to determine root causes 
for such errors. To perform tracing through software 
requires the capability to simulate the software proc- 
esses and record their justifications. For example, in 
Figure 10., we show with the bold lines information 
dependencies between software tasks and shared 
memory elements of the CVT. Initially CVT(1,7) and 
CVT(2,6) are read by Ada Taskl at some framecount 
which writes out intermediate data product CVT(2,5). 


Ada Taskn reads this data product at a later framecount 
to produce CVT (3,2). In this manner, CVT(3,2) de- 
pends on CVT(1,7) and CVT(2,6). 

The second scenario demonstrates this capability 
with a simulation which propagates a message a 
through an MDM (from BIA to SPD) by utilizing a 
CVT memoiy element. We have developed hardware 
and software components for this simulation and will 
highlight the software task component in this paper.. 


Ada Task 1 Current Value Table (CVT) Ada Task 2 



Figure 10. Memory location (3,1) depends upon mem- 
ory locations (1,7) and (1,6) through location (2,5). 


The scenario begins with the power supply status 
nominal (#1) and power- in nominal (#2). The power 
supply is turned on (#3) to provide power to the rest of 
the MDM including the processor (IOCU^ BIA + 
DRAM) as well the SPD cards. The BIA is initialized 
as an RT (#4,5) followed by the SPD card which is 
reset and reinitialized (#6,7). When the MDM frame- 
count starts (#8), Ada tasks execute when the fra- 
mecount is within their nominal range (#9,13). The 
synchronized execution of a set of Ada tasks controls 
the propagation of information through the MDM. In 
the scenario, the information enters the MDM via the 
BIA input (from an upper tier computer or from MCC) 
X#9)7XT55 3XdaTask Is scheduiM to transferlhe^BIA 
value to its CVT. The CVT is given a write command 
(#11), with value to be stored on the write line. Once 
the memory latches, the Ada task stops executing and 
the value is available on CVT read line (#12). Two 
increments of the framecount later, and then another 
1553 Ada task is executed which copies the CVT read 
data to SPD (#13) ready for propagation to a lower tier 
MDM. 

We simulate the software at a high-level, not mod- 
eling the internals of the software but only the timing 





Ada Task Component 


as well as the inputs and outputs of the software. We 
assume that any output of a software routine could be 
dependent on all inputs. To define the timing, dura- 
tion of the Ada task and elements of the CVT used by 
these tasks, requires an analysis of the Ada source code 
and its documentation[22, 23 ,24,26,27]. 



Figure 11. An MDM simulation to propagate data 
from BIA_in to SPD out in 3 framecount increments. 

Each Ada task is designed to execute on particular 
framecounts. The framecount is an integer number 
from 0 to 99, where each integer increment is 0.1 sec. 
This is how ISS ensures that Ada tasks do not 
read/write the CVT at the same time. We introduce 
metric time into MBR models to model the timed 
propagation of information through the C&DH system 
by - augmenting — traditional components - from - 

model-based reasoning with temporal preconditions. 
(Fig 12.). This allows us to explicity control the 
propagation of information in the structural 
dependency models. Instead of components 
propagating their data product at every tick of the 
diagnosis/simulation engine cycle, propagation only 
occurs when the framecount is within nominal range 
as defined by the Ada task component (i.e. startjime < 
framecount < stop_time). 



Figure 12. Each Ada task executes only when frame- 
count is within time window: [start time stop time]. 

7. Discussion 

We have presented our preliminary work towards 
defining an integrated hardware/software FDIR system 
to augment the ISS C&W system. We have addressed 
unexplained C&DH PRACAs as well as false-positive 
C&W rates, two serious issues in ISS data handling. 
We find that though ISS acknowledges that the state 
space of the software is too large to test at design 
time[6], no solution is provided for runtime software 
V&V. Advances in model-based diagnosis (MBD) to 
include software components [11,12,13] provides the 
roadmap for integrated hardware and software FDIR. 
Traditionally it is easier to model hardware compo- 
nents, because the physics underlying these models has 
been developed over hundreds of years (KirchofFs 
laws, Bernoulli’s equations), while paradigms for 
modeling software components is still being developed 
today (UML). 

Scale-up challenges using simulation methods of 
MBD led us to develop synergistic strategies to utilize 
both the TEAMS and L2 tools. TEAMS [17,29,30,32] 
has been used for diagnosis of aerospace vehicles 
including portions the ISS and C&DH, while 
L2[15,16] has flown on the Deep Space 1 spacecraft 
with its single string 1553 bus network. Still, due to 
their coarse abstractions of the underlying structural 
dependency models, in the future we will utilize de- 
tailed simulations using the ISS software emulator 
MADE [23]. 

This work is related to CRANS (Configurable 
Realtime Analysis System) [14] dependency tracking, 
real-time monitoring, mission control tool. In CRANS, 
structural topologies of the system are encoded in 
logical form similar to fault tress. It is useful for 




determining upstream causes and downstream effects 
of failued ORUs (orbital replacement units). But 
CRANS is difficult to configure, does not directly tie 
in with the C&W system and does not capture software 
dependencies. Future work could address automatic 
methods to configure CRANS from domain 
knowlegde in our tools. We also relate to Program 
Execution Montitor (PEM) cells [21] , as a “specialist 
software module to assist with detailed analysis” for 
state eastimation and recovery options as well as 
provide structural dependency models which can be 
used to generate PEM cells. Both [21] and [30] share 
out goal of augmenting the C&W system. The 
challengeall of these approaches face including the 
approach we have presented today, is to ensure that the 
systems can easily be configured for the current ISS 
stage [9]. 
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